Goto

Collaborating Authors

 speech processing



Toward a realistic model of speech processing in the brain with self-supervised learning

Neural Information Processing Systems

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g.


Spoken Conversational Agents with Large Language Models

Yang, Chao-Han Huck, Stolcke, Andreas, Heck, Larry

arXiv.org Artificial Intelligence

Building on this, we will examine joint text-speech pre-training (Chiu et al., 2022; Bar-rault et al., 2023; Chen et al., 2022) methods, This section will provide a comprehensive look at how state-of-the-art voice-interfaced LLMs (Reid et al., 2024; Chu et al., Current Trends The current work in AI virtual assistants builds upon the voice-only systems of the last decade by leveraging LLMs to significantly improve the coverage and robustness of the spoken language understanding and dialogue state tracking components, in addition to substantial advancements in spoken language generation. It highlights recent advancements in multi-turn dialogue systems, encompassing both LLM-based open-domain dialogue (ODD) and task-oriented dialogue (TOD) systems, as well as relevant datasets and evaluation metrics.



WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing

Baser, Oguzhan, Tanriverdi, Ahmet Ege, Kale, Kaan, Chinchali, Sandeep P., Vishwanath, Sriram

arXiv.org Artificial Intelligence

Speech embeddings often retain sensitive attributes such as speaker identity, accent, or demographic information, posing risks in biased model training and privacy leakage. We propose WavShape, an information-theoretic speech representation learning framework that optimizes embeddings for fairness and privacy while preserving task-relevant information. We leverage mutual information (MI) estimation using the Donsker-V aradhan formulation to guide an MI-based encoder that systematically filters sensitive attributes while maintaining speech content essential for downstream tasks. Experimental results on three known datasets show that WavShape reduces MI between embeddings and sensitive attributes by up to 81% while retaining 97% of task-relevant information.




The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

Emezue, Chris, Community, NaijaVoices, Awobade, Busayo, Owodunni, Abraham, Emezue, Handel, Emezue, Gloria Monica Tobechukwu, Emezue, Nefertiti Nneoma, Ogun, Sewade, Akinremi, Bunmi, Adelani, David Ifeoluwa, Pal, Chris

arXiv.org Artificial Intelligence

The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.


Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain

Moussa, Omer, Toneva, Mariya

arXiv.org Artificial Intelligence

Pretrained self-supervised speech models excel in speech tasks but do not reflect the hierarchy of human speech processing, as they encode rich semantics in middle layers and poor semantics in late layers. Recent work showed that brain-tuning (fine-tuning models using human brain recordings) improves speech models' semantic understanding. Here, we examine how well brain-tuned models further reflect the brain's intermediate stages of speech processing. We find that late layers of brain-tuned models substantially improve over pretrained models in their alignment with semantic language regions. Further layer-wise probing reveals that early layers remain dedicated to low-level acoustic features, while late layers become the best at complex high-level tasks. These findings show that brain-tuned models not only perform better but also exhibit a well-defined hierarchical processing going from acoustic to semantic representations, making them better model organisms for human speech processing.


The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

Gao, Ming, Wu, Shilong, Chen, Hang, Du, Jun, Lee, Chin-Hui, Watanabe, Shinji, Chen, Jingdong, Marco, Siniscalchi Sabato, Scharenborg, Odette

arXiv.org Artificial Intelligence

Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Di-arization (A VSD), Audio-Visual Speech Recognition (A VSR), and Audio-Visual Diarization and Recognition (A VDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top A VSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top A VSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best A VDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.